Global genome compendium curated from human stool samples and strain isolates from 31 countries across six continents. This catalog includes 289,232 prokaryotic genomes clustered into 4,744 species representatives. These genomes encode >170 million protein sequences, collated in the Unified Human Gastrointestinal Protein (UHGP) catalog. The UHGP more than doubles the number of gut proteins in comparison to those present in the Integrated Gene Catalog. More than 70% of the UHGG species lack cultured representatives, and 40% of the UHGP lack functional annotations (Almeida et al. 2021).
Human Reference Gut Microbiome - Korean, India, Japanese data-sets (HRGM-KIJ)
An addition of under-represented Asian countries to the UHGG v1 , with 90, 110, and 645 human microbiome samples collected from Korea, India, and Japan, respectively. Genome assembly produced 29,082 prokaryotic genomes from 845 fecal metagenomes (Merrill et al., n.d.).
Early-Life Gut Genomes (ELGG)
ELGG contains 32,277 genomes representing 2,172 species from 6,122 fecal metagenomes collected from children under 3 years old spanning delivery mode, gestational age, feeding pattern, and geography. The ELGG substantially expanded the phylogenetic diversity by 38% over the isolate microbial genomes, and the genomic landscape of the early-life microbiome by increasing recruitment of metagenomic reads to 82.8%. More than 60% of the ELGG species lack an isolate representative. The conspecific genomes of the most abundant species from children differed in gene diversity and functions compared to adults. The ELGG genomes encode over 80 million protein sequences, forming the Early-Life Gut Proteins (ELGP) catalog with over four million protein clusters, 29.5% of which lacked functional annotations (Zeng et al. 2022).
Hadza
Ultra-deep metagenomic sequencing and strain cultivation on 351 fecal samples from the Hadza, hunter-gatherers in Tanzania, and comparative populations in Nepal and California. This data set contains 94,971 genomes of bacteria, archaea, bacteriophages, and eukaryotes, 43% of which are absent from existing unified datasets (Merrill et al., n.d.).
RUMMETA
A ruminant gastrointestinal microbial reference catalog with 154,335,274 non-redundant genes and 8,745 uncultured species from over 10,373 metagenome-assembled genomes (MAGs). This genome compendium was built from 370 gastrointestinal (GIT) content samples, spanning 10 different GIT regions sampled from seven ruminant species (dairy cattle, Bos taurus; water buffalo, Bubalus bubalis; yak, Bos grunniens; goat, Capra aegagrus; sheep, Ovis aries; roe deer, Capreolus pygargus; water deer, Hydropotes inermis) (Xie et al. 2021).
Rumen-Uncultured Genomes (RUGs)
5,588 rumen MAGs collected from 283 ruminant beef cattle and 24 rumen fluid samples from six Boran (Bos indicus), indigenous Zebu cattle from sub-Saharan Africa (Wilkinson et al. 2020; Stewart et al. 2019).
EuPathDB
EuPathDB includes 11 databases (AmoebaDB, CryptoDB, FungiDB, GiardiaDB, HostDB, MicrosporidiaDB, OrthoMCLDB, PiroplasmaDB, PlasmoDB, ToxoDB, TrichDB, TriTrypDB, and VectorBase) which together represent 689 eukaryotic microorganisms. This data includes both known pathogens and other closely related non-infectious eukaryotic species(Aurrecoechea et al. 2013).
Metagenomic Gut Virus catalog (MGV)
189,680 viral genomes from 11,810 publicly available human stool metagenomes with 11,837,198 non-redundant proteins identified. Over 75% of genomes represent double-stranded DNA phages that infect members of the Bacteroidia and Clostridia classes. Sequence clustering reveals 54,118 candidate viral species, 92% of which were not found in existing databases. The Metagenomic Gut Virus catalog improves detection of viruses in stool metagenomes and accounts for nearly 40% of CRISPR spacers found in human gut Bacteria and Archaea (Nayfach et al. 2021).
WormBase ParaSite (WBPS17)
Databases consisting of 210 genomes, representing 169 distinct species from the Nemotoda (Roundworm) and Plathelminthes (Flatworm) phyla (Harris et al. 2020).
Data sources summary
Catalog
Release
Domain(s)
Samples
Genomes
Proteins (95%)
DOI
catalog
UHGG/UHGP
Mgnify Unified Human Gastrointestinal Genomes/Proteins v2.0.1
#' https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-021-01078-x #' An integrated gene catalog and over 10,000 metagenome-assembled genomes #' from the gastrointestinal microbiome of ruminants# http://rummeta.njau.edu.cn/rumment/download/downloadFile?file=RGMGC.geneSet.faa.gzwget_download_slurm(jobname ="RUMMETA",download_link ="http://rummeta.njau.edu.cn/rumment/download/RGMGC.geneSet.faa.gz", slurm_out ="/central/scratch/jbok/slurmdump/",output_dir = protein_catalogs )
At 95% similarity, our catalog contains 164,949,326 unique protein sequences. This compendium represents the global human and ruminant gut microbiome genomic repertoire, in addition to common eukaryotic pathogens.
Almeida, Alexandre, Stephen Nayfach, Miguel Boland, Francesco Strozzi, Martin Beracochea, Zhou Jason Shi, Katherine S. Pollard, et al. 2021. “A Unified Catalog of 204,938 Reference Genomes from the Human Gut Microbiome.”Nature Biotechnology 39 (1): 105–14. https://doi.org/10.1038/s41587-020-0603-3.
Aurrecoechea, Cristina, Ana Barreto, John Brestelli, Brian P. Brunk, Shon Cade, Ryan Doherty, Steve Fischer, et al. 2013. “EuPathDB: The Eukaryotic Pathogen Database.”Nucleic Acids Research 41 (D1): D684–91. https://doi.org/10.1093/nar/gks1113.
Harris, Todd W, Valerio Arnaboldi, Scott Cain, Juancarlos Chan, Wen J Chen, Jaehyoung Cho, Paul Davis, et al. 2020. “WormBase: A Modern Model Organism Information Resource.”Nucleic Acids Research 48 (D1): D762–67. https://doi.org/10.1093/nar/gkz920.
Merrill, Bryan D., Matthew M. Carter, Matthew R. Olm, Dylan Dahan, Surya Tripathi, Sean P. Spencer, Brian Yu, et al. n.d. “Ultra-Deep Sequencing of Hadza Hunter-Gatherers Recovers Vanishing Gut Microbes.”https://doi.org/10.1101/2022.03.30.486478.
Nayfach, Stephen, David Páez-Espino, Lee Call, Soo Jen Low, Hila Sberro, Natalia N. Ivanova, Amy D. Proal, et al. 2021. “Metagenomic Compendium of 189,680 DNA Viruses from the Human Gut Microbiome.”Nature Microbiology 6 (7): 960–70. https://doi.org/10.1038/s41564-021-00928-6.
Stewart, Robert D., Marc D. Auffret, Amanda Warr, Alan W. Walker, Rainer Roehe, and Mick Watson. 2019. “Compendium of 4,941 Rumen Metagenome-Assembled Genomes for Rumen Microbiome Biology and Enzyme Discovery.”Nature Biotechnology 37 (8): 953–61. https://doi.org/10.1038/s41587-019-0202-3.
Wilkinson, Toby, Daniel Korir, Moses Ogugo, Robert D. Stewart, Mick Watson, Edith Paxton, John Goopy, and Christelle Robert. 2020. “1200 High-Quality Metagenome-Assembled Genomes from the Rumen of African Cattle and Their Relevance in the Context of Sub-Optimal Feeding.”Genome Biology 21 (1): 229. https://doi.org/10.1186/s13059-020-02144-7.
Xie, Fei, Wei Jin, Huazhe Si, Yuan Yuan, Ye Tao, Junhua Liu, Xiaoxu Wang, et al. 2021. “An Integrated Gene Catalog and over 10,000 Metagenome-Assembled Genomes from the Gastrointestinal Microbiome of Ruminants.”Microbiome 9 (1): 137. https://doi.org/10.1186/s40168-021-01078-x.
Zeng, Shuqin, Dhrati Patangia, Alexandre Almeida, Zhemin Zhou, Dezhi Mu, R. Paul Ross, Catherine Stanton, and Shaopu Wang. 2022. “A Compendium of 32,277 Metagenome-Assembled Genomes and over 80 Million Genes from the Early-Life Human Gut Microbiome.”Nature Communications 13 (1): 5139. https://doi.org/10.1038/s41467-022-32805-z.
Source Code
---title: "Protein Catalog Curation"editor: visualauthor: "Joe Boktor"date: '2023-01-24'format: html: font-family: helvetica neue page-layout: full toc: true toc-location: left toc-depth: 3 self-contained: true code-fold: true code-tools: true fig-align: centerbibliography: references.bib---## Data sources#### **Unified Human Gastrointestinal Genome (UHGG/UHGP)**Global genome compendium curated from human stool samples and strain isolates from 31 countries across six continents. This catalog includes 289,232 prokaryotic genomes clustered into 4,744 species representatives. These genomes encode \>170 million protein sequences, collated in the Unified Human Gastrointestinal Protein (UHGP) catalog. The UHGP more than doubles the number of gut proteins in comparison to those present in the Integrated Gene Catalog. More than 70% of the UHGG species lack cultured representatives, and 40% of the UHGP lack functional annotations [@almeida2021].#### **Human Reference Gut Microbiome - Korean, India, Japanese data-sets (HRGM-KIJ)**An addition of under-represented Asian countries to the UHGG v1 , with 90, 110, and 645 human microbiome samples collected from Korea, India, and Japan, respectively. Genome assembly produced 29,082 prokaryotic genomes from 845 fecal metagenomes [@merrill].#### **Early-Life Gut Genomes (ELGG)**ELGG contains 32,277 genomes representing 2,172 species from 6,122 fecal metagenomes collected from children under 3 years old spanning delivery mode, gestational age, feeding pattern, and geography. The ELGG substantially expanded the phylogenetic diversity by 38% over the isolate microbial genomes, and the genomic landscape of the early-life microbiome by increasing recruitment of metagenomic reads to 82.8%. More than 60% of the ELGG species lack an isolate representative. The conspecific genomes of the most abundant species from children differed in gene diversity and functions compared to adults. The ELGG genomes encode over 80 million protein sequences, forming the Early-Life Gut Proteins (ELGP) catalog with over four million protein clusters, 29.5% of which lacked functional annotations [@zeng2022].#### **Hadza**Ultra-deep metagenomic sequencing and strain cultivation on 351 fecal samples from the Hadza, hunter-gatherers in Tanzania, and comparative populations in Nepal and California. This data set contains 94,971 genomes of bacteria, archaea, bacteriophages, and eukaryotes, 43% of which are absent from existing unified datasets [@merrill].#### **RUMMETA**A ruminant gastrointestinal microbial reference catalog with 154,335,274 non-redundant genes and 8,745 uncultured species from over 10,373 metagenome-assembled genomes (MAGs). This genome compendium was built from 370 gastrointestinal (GIT) content samples, spanning 10 different GIT regions sampled from seven ruminant species (dairy cattle, *Bos taurus*; water buffalo, *Bubalus bubalis*; yak, *Bos grunniens*; goat, *Capra aegagrus*; sheep, *Ovis aries*; roe deer, *Capreolus pygargus*; water deer, *Hydropotes inermis*) [@xie2021].#### **Rumen-Uncultured Genomes (RUGs)**5,588 rumen MAGs collected from 283 ruminant beef cattle and 24 rumen fluid samples from six Boran (*Bos indicus*), indigenous Zebu cattle from sub-Saharan Africa [@wilkinson2020; @stewart2019].#### **EuPathDB**EuPathDB includes 11 databases (AmoebaDB, CryptoDB, FungiDB, GiardiaDB, HostDB, MicrosporidiaDB, OrthoMCLDB, PiroplasmaDB, PlasmoDB, ToxoDB, TrichDB, TriTrypDB, and VectorBase) which together represent 689 eukaryotic microorganisms. This data includes both known pathogens and other closely related non-infectious eukaryotic species[@aurrecoechea2013].#### **Metagenomic Gut Virus catalog (MGV)**189,680 viral genomes from 11,810 publicly available human stool metagenomes with 11,837,198 non-redundant proteins identified. Over 75% of genomes represent double-stranded DNA phages that infect members of the Bacteroidia and Clostridia classes. Sequence clustering reveals 54,118 candidate viral species, 92% of which were not found in existing databases. The Metagenomic Gut Virus catalog improves detection of viruses in stool metagenomes and accounts for nearly 40% of CRISPR spacers found in human gut Bacteria and Archaea [@nayfach2021].#### **WormBase ParaSite (WBPS17)**Databases consisting of 210 genomes, representing 169 distinct species from the Nemotoda (Roundworm) and Plathelminthes (Flatworm) phyla [@harris2020].## Data sources summary+--------------------------------------------------------------------+---------+-----------+------------------------------------+---------+----------------+------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------+| Catalog | Release | Domain(s) | Samples | Genomes | Proteins (95%) | DOI | catalog |+====================================================================+=========+===========+====================================+=========+================+================================================================================================+===================================================================================================================================+| **UHGG/UHGP** | 2022 | Bacteria | 25,088 + cultured genome resources | 289,232 | 18,482,368 | [link](https://www.nature.com/articles/s41587-020-0603-3) | [link](http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v2.0.1/) || | | | | | | | || Mgnify Unified Human Gastrointestinal Genomes/Proteins v2.0.1 | | Archaea | | | | [link](https://www.nature.com/articles/s41591-019-0559-3) | |+--------------------------------------------------------------------+---------+-----------+------------------------------------+---------+----------------+------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------+| **HRGM-KIJ** | 2021 | Bacteria | 845 | 29,082 | 6,312,390 | [link](https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-021-00950-7) | [link](https://www.mbiomenet.org/HRGM/listdir.php?directory=data/protein_catalog) || | | | | | | | || Human Reference Gut Microbiome (Korea, India, Japan) | | Archaea | | | | | |+--------------------------------------------------------------------+---------+-----------+------------------------------------+---------+----------------+------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------+| **ELGG/ELGP** | 2022 | Bacteria | 32,277 | 6,122 | 3,879,904 | [link](https://www.nature.com/articles/s41467-022-32805-z) | [link](https://zenodo.org/record/6969520/files/ELGP_95.faa.gz) || | | | | | | | || Early-Life Gut Genomes/Proteins | | Archaea | | | | | |+--------------------------------------------------------------------+---------+-----------+------------------------------------+---------+----------------+------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------+| **Hadza** | 2022 | Bacteria | 388 | 94,971 | 32,431,351 | [link](https://www.biorxiv.org/content/10.1101/2022.03.30.486478v2.full) | [link](https://www.ebi.ac.uk/ebisearch/search/cross-references?db=project&id=PRJEB49206&ref=genomes&requestFrom=relatedData-xref) || | | | | | | | || African Hadza tribe and comparative Nepali, and California samples | | Archaea | | | | | || | | | | | | | || | | Virus | | | | | || | | | | | | | || | | Eukaryote | | | | | |+--------------------------------------------------------------------+---------+-----------+------------------------------------+---------+----------------+------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------+| **RUMMETA** | 2021 | Bacteria | 370 | 10,373 | 109,586,913 | [link](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-021-01078-x#MOESM5) | [link](http://rummeta.njau.edu.cn/rumment/download/downloadFile?file=RGMGC.geneSet.faa.gz) || | | | | | | | || 10 GIT regions of seven ruminant species | | Archaeal | | | | | |+--------------------------------------------------------------------+---------+-----------+------------------------------------+---------+----------------+------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------+| **RUGs** | 2019 | Bacteria | 207 | 5,588 | 5,502,611 | [link](https://www.nature.com/articles/s41587-019-0202-3) | [link](http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/cow-rumen/v1.0/protein_catalogue/) || | | | | | | | || MGnify Cow Rumen v1.0 | | Archaeal | | | | [link](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02144-7) | |+--------------------------------------------------------------------+---------+-----------+------------------------------------+---------+----------------+------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------+| **MGV** | 2021 | Virus | \_ | 189,680 | 1,725,656 | [link](https://www.nature.com/articles/s41564-021-00928-6) | [link](https://portal.nersc.gov/MGV/MGV_v1.0_2021_07_08/) || | | | | | | | || Metagenomic Gut Virus Catalog | | | | | | | |+--------------------------------------------------------------------+---------+-----------+------------------------------------+---------+----------------+------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------+| **WormBase ParaSite** (WBPS17) | 2019 | Eukaryote | \_ | 210 | 10,328,252 | [link](https://academic.oup.com/nar/article/48/D1/D762/5603222) | [link](https://parasite.wormbase.org/ftp.html) |+--------------------------------------------------------------------+---------+-----------+------------------------------------+---------+----------------+------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------+| **EuPathDB** | 2022 | Eukaryote | \_ | 689 | 3,456,934 | [link](https://academic.oup.com/nar/article/41/D1/D684/1059841) | [link](https://veupathdb.org/veupathdb/app/downloads/Current_Release/) || | | | | | | | || eukaryotic pathogen database (Release 61) | | | | | | | |+--------------------------------------------------------------------+---------+-----------+------------------------------------+---------+----------------+------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------+Analysis Setup```{r}#| warning: falselibrary(tidyverse)library(reticulate)library(magrittr)library(glue)library(seqinr)library(future)library(future.batchtools)library(furrr)library(fs)library(tictoc)library(listenv)library(progress)library(tibble)library(protr)library(kableExtra)library(plotly)library(ggsci)library(data.table)library(ggside)tmpdir <-"/central/scratch/jbok/tmp"homedir <-"/central/groups/MazmanianLab/joeB"wkdir <-glue("{homedir}/","Microbiota-Immunomodulation/Microbiota-Immunomodulation")src_dir <-glue("{wkdir}/notebooks")source(glue("{src_dir}/R-scripts/helpers_general.R"))source(glue("{src_dir}/R-scripts/helpers_sequence-screens.R"))protein_catalogs <-glue("{homedir}/Downloads/protein_catalogs")```Visualizing catalog summary statistics```{r, eval=FALSE}#| fig-cap: Catalog summary statisticscatalog_stats <-tribble(~catalog, ~Bacteria, ~Archaea, ~Virus, ~Eukaryote, ~Genomes, ~Samples, ~proteins95,"UHGP", 1, 1, 0, 0, 289232, 25088, 18482368,"KIJ", 1, 1, 0, 0, 29082, 845, 6312390,"ELGP", 1, 1, 0, 0, 6122, 32277, 3879904,"Hadza", 1, 1, 1, 1, 94971, 388, 32431351,"RUMMETA", 1, 1, 0, 0, 10373, 370, 109586913,"RUGs", 1, 1, 0, 0, 5588, 207, 5502611,"MGV", 0, 0, 1, 0, 189680, NA, 1725656,"WormBase", 0, 0, 0, 1, 210, NA, 10328252,"EuPathDB", 0, 0, 0, 1, 689, NA, 3456934)p_bar1 <- catalog_stats %>%ggplot(aes(y =reorder(catalog, proteins95), x = proteins95)) +geom_col(width =0.6, fill ="lightblue") +labs(y =NULL, x ="No. of proteins in each catalog (95% similarity)") +theme_light() +scale_x_continuous(expand =c(0, 0)) +theme(panel.grid.major.y =element_blank())p_bar2 <- catalog_stats %>%ggplot(aes(y =reorder(catalog, Genomes), x = Genomes)) +geom_col(width =0.6, fill ="lightblue") +labs(y =NULL, x ="No. of Assembled Genomes") +theme_light() +scale_x_continuous(expand =c(0, 0)) +theme(panel.grid.major.y =element_blank())p_bar3 <- catalog_stats %>%drop_na(Samples) %>%ggplot(aes(y =reorder(catalog, Samples), x = Samples)) +geom_col(width =0.6, fill ="lightblue") +labs(y =NULL, x ="No. of Mammalian Samples") +theme_light() +scale_x_continuous(expand =c(0, 0)) +theme(panel.grid.major.y =element_blank())patch1 <- (p_bar1 / p_bar2 / p_bar3) +plot_layout(heights =c(1, 1, 0.666))ggsave(patch1, filename =glue("{wkdir}/figures/protein_catalog/summary_stats_barplot.png"),width =6, height =8, dpi =300)``````{r, eval=FALSE}p_donut <- catalog_stats %>%summarise_at(vars(Bacteria, Archaea, Virus, Eukaryote), sum) %>%pivot_longer(cols =everything(), names_to ="category", values_to ="count") %>%mutate(fraction = count /sum(count)) %>%mutate(ymax =cumsum(fraction)) %>%mutate(ymin =c(0, head(ymax, n =-1))) %>%mutate(labelPosition = (ymax + ymin) /2) %>%mutate(label =paste0(category, "\n catalogs: ", count)) %>%ggplot(aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=category)) +geom_rect() +geom_text( x=2, aes(y=labelPosition, label=label, color=category), size=5 ) +scale_fill_npg() +scale_color_npg() +coord_polar(theta="y") +xlim(c(-1, 4)) +theme_void() +theme(legend.position ="none")ggsave(p_donut, filename =glue("{wkdir}/figures/protein_catalog/domain_donut.png"),width =8, height =8, dpi =300)```{width=60%}## Downloading and Formatting Sequence Data#### UHGP v 2.0```{r, eval=FALSE}uhgp_ftp_links <-paste0("http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/","human-gut/v2.0.1/protein_catalogue/uhgp-", c(100, 95, 90, 50), ".tar.gz" )for (f in uhgp_ftp_links) { jname <- fs::path_ext_remove(basename(f))wget_download_slurm(jobname = jname,slurm_out ="/central/scratch/jbok/slurmdump/",download_link = f,output_dir = protein_catalogs )}``````{r, eval=FALSE}# Unzip uhgp-100.tar.gzslurm_shell_do(cmd =glue("tar -xvzf {protein_catalogs}/UHGP_v2.0.1/uhgp-95.tar.gz"," -C {protein_catalogs}/UHGP_v2.0.1/"),memory ="200G",walltime =144000)```#### ELGG```{r, eval=FALSE}# Download 95% ELGG protein faa via https://zenodo.org/record/6969520/files/ELGP_95.faa.gz?download=1# wget https://zenodo.org/record/6969520/files/ELGP_95.faa.gz wget_download_slurm(jobname ="ELGG",download_link ="https://zenodo.org/record/6969520/files/ELGP_95.faa.gz",slurm_out ="/central/scratch/jbok/slurmdump/",output_dir = protein_catalogs )``````{r, eval=FALSE}# Unzip ELGP_95.faa.gzslurm_shell_do(cmd =glue("gunzip {protein_catalogs}/ELGP_95.faa.gz"),memory ="50G",walltime =36000)# shell_do("gunzip /central/groups/MazmanianLab/joeB/Downloads/protein_catalogs/ELGP_95.faa.gz")```#### KIJ```{r, eval=FALSE}# HRGM (KIJ 100 only) using 110 India, 645 Japan, and 90 Korean fecal samples# https://www.mbiomenet.org/HRGM/data/protein_catalog/2.KIJ_CD-HIT-100_Proteins/KIJ_CD-HIT-100_Proteins.faa.gzwget_download_slurm(jobname ="HRGM_KIJ100",download_link ="https://www.mbiomenet.org/HRGM/data/protein_catalog/2.KIJ_CD-HIT-100_Proteins/KIJ_CD-HIT-100_Proteins.faa.gz",slurm_out ="/central/scratch/jbok/slurmdump/",output_dir = protein_catalogs)``````{r, eval=FALSE}# Unzipslurm_shell_do(cmd =glue("gunzip {protein_catalogs}/KIJ_CD-HIT-100_Proteins.faa.gz"),memory ="50G",walltime =36000)# shell_do("gunzip /central/groups/MazmanianLab/joeB/Downloads/protein_catalogs/KIJ_CD-HIT-100_Proteins.faa.gz")```#### Hadza/Nepali Hunter-Gatherers```{r, eval=FALSE}# Downloading Hadza Mag metadatashell_do(glue("mkdir -p {wkdir}/data/input/protein_catalog_metadata"))shell_do(glue("wget -P {wkdir}/data/input/protein_catalog_metadata/"," https://www.biorxiv.org/content/biorxiv/early/2022/10/07/2022.03.30.486478/DC2/embed/media-2.xlsx",))shell_do(glue("mv {wkdir}/data/input/protein_catalog_metadata/media-2.xlsx"," {wkdir}/data/input/protein_catalog_metadata/Hadza-MAG-metadata.xlsx"))``````{r, eval=FALSE}hadza_meta_PRJEB49206 <-read.delim(glue("{wkdir}/data/input/protein_catalog_metadata/","filereport_analysis_PRJEB49206_tsv.txt" ),stringsAsFactors =FALSE, header =TRUE )# get list of files that don't currently existhadza_files_unprocessed_lgr <- hadza_meta_PRJEB49206$submitted_ftp %>% purrr::map(~!(file.exists(glue("{hadza_metagenomes_dir}/{basename(.)}")) |file.exists(glue("{hadza_metagenomes_dir}/","{fs::path_ext_remove(basename(.))}" )) ))hadza_files_unprocessed <- hadza_meta_PRJEB49206$submitted_ftp %>%keep(unlist(hadza_files_unprocessed_lgr))# Initiate future.batchtools backend for parallel processingfuture::plan( future.batchtools::batchtools_slurm,template =glue("{wkdir}/batchtools_templates/batchtools.slurm.tmpl"),resources =list(name ="hadza_agg-download",memory ="1G",ncpus =2,walltime =360000 ))# Chunk files (100 per job) and downloadtic()n_jobs <-ceiling(length(hadza_files_unprocessed) /100)download_runs <-listenv()for (job in1:n_jobs) { input_chunk <-chunk_func(hadza_files_unprocessed, n_jobs)[[job]] download_runs[[job]] %<-%wget_path_list(download_list = input_chunk,download_dir ="/central/scratch/jbok/PRJEB49206_HADZA" )}toc()# list.files("/central/scratch/jbok/PRJEB49206_HADZA") %>% length``````{r, eval=FALSE}#______________________________________________________________________# Prodigal on Hadza MAGs ----# Initiate future.batchtools backend for parallel processingfuture::plan( future.batchtools::batchtools_slurm,template =glue("{wkdir}/batchtools_templates/batchtools.slurm.tmpl"),resources =list(name ="prodigal-Hadza-MAGs",memory ="5G",ncpus =1,walltime =360000 ))# select input genome fasta files paths for prodigalhadza_proteins_dir <-glue("{homedir}/Downloads/","protein_catalogs/hadza_proteins")hadza_mag_paths <-list.files(path ="/central/scratch/jbok/PRJEB49206_HADZA",full.names =TRUE )hadza_mag_unprocessed_lgr <- hadza_mag_paths %>% purrr::map( ~!file.exists(glue("{hadza_proteins_dir}/{basename(.)}")))hadza_mag_paths_to_process <- hadza_mag_paths %>%keep(unlist(hadza_mag_unprocessed_lgr))# generate list of desired fasta outputshadza_proteins_paths <- purrr::map( hadza_mag_paths_to_process,~glue("{hadza_proteins_dir}/{fs::path_ext_remove(basename(.))}") )# Chunk files (500 per job) and downloadtic()n_jobs <-ceiling(length(hadza_mag_paths_to_process) /25)prodigal_runs <-listenv()for (job in1:n_jobs) { input_list <-chunk_func(hadza_mag_paths_to_process, n_jobs)[[job]] output_list <-chunk_func(hadza_proteins_paths, n_jobs)[[job]] prodigal_runs[[job]] %<-%shell_do_prodigal_list(input_genome_list = input_list,output_fasta_list = output_list )}toc()hadza_files <-list.files(glue("{protein_catalogs}/hadza_proteins"),full.names =TRUE )hadza_files %>% length``````{r, eval=FALSE}pb <- progress_bar$new(total =length(hadza_files),format ="[:bar] :current/:total (:percent)" )for (f in hadza_files){ pb$tick()shell_do(glue("cat {f} >> {protein_catalogs}/hadza.fasta" ))}```#### Ruminants - RUMMETA```{r, eval=FALSE}#' https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-021-01078-x #' An integrated gene catalog and over 10,000 metagenome-assembled genomes #' from the gastrointestinal microbiome of ruminants# http://rummeta.njau.edu.cn/rumment/download/downloadFile?file=RGMGC.geneSet.faa.gzwget_download_slurm(jobname ="RUMMETA",download_link ="http://rummeta.njau.edu.cn/rumment/download/RGMGC.geneSet.faa.gz", slurm_out ="/central/scratch/jbok/slurmdump/",output_dir = protein_catalogs )``````{r, eval=FALSE}slurm_shell_do(cmd =glue("gunzip {protein_catalogs}/RGMGC.geneSet.faa.gz"),memory ="50G",walltime =36000)```#### Ruminants - RUGs```{r, eval=FALSE}rug_ftp_links <-paste0("http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/","cow-rumen/v1.0/protein_catalogue/protein_catalogue-", c(100, 95, 90, 50), ".tar.gz")for (f in rug_ftp_links) { jname <- fs::path_ext_remove(basename(f))wget_download_slurm(jobname = jname,slurm_out ="/central/scratch/jbok/slurmdump/",download_link = f,output_dir =glue("{protein_catalogs}/cow-rumen-v1.0") )}``````{r, eval=FALSE}slurm_shell_do(cmd =glue("tar -xvzf {protein_catalogs}/cow-rumen-v1.0/protein_catalogue-95.tar.gz"," -C {protein_catalogs}/cow-rumen-v1.0/"),memory ="50G",walltime =36000)```#### MGV```{r, eval=FALSE}wget_download_slurm(jobname ="MGV_proteins",download_link ="https://portal.nersc.gov/MGV/MGV_v1.0_2021_07_08/mgv_proteins.faa", slurm_out ="/central/scratch/jbok/slurmdump/",output_dir = protein_catalogs )```#### VEuPathDB```{r, eval=FALSE}protein_catalogs <-glue("{homedir}/", "Downloads/protein_catalogs/")VEuPathDB_meta <-read.delim(glue("{wkdir}/data/input/protein_catalog_metadata/VEuPathDB_release-61_2022-12-15.txt"),stringsAsFactors =FALSE, header =TRUE )# checking length of dataVEuPathDB_links <- VEuPathDB_meta$protein_fasta_link %>%unique()VEuPathDB_links %>%length()for (f in VEuPathDB_links) {if (!file.exists(glue("{protein_catalogs}/VEuPathDB_v61/{basename(f)}"))){ jname <- fs::path_ext_remove(basename(f))wget_download_slurm(jobname = jname,slurm_out ="/central/scratch/jbok/slurmdump/",download_link = f,output_dir =glue("{protein_catalogs}/VEuPathDB_v61") )Sys.sleep(5) }}``````{r, eval=FALSE}# Some runs failed, collect and examine filepathsdownload_failed <-list()for (f in VEuPathDB_links) {if (!file.exists(glue("{protein_catalogs}/VEuPathDB_v61/{basename(f)}"))){ jname <- fs::path_ext_remove(basename(f)) download_failed[[jname]] <- f }}# rerun download failures, downloading genomes (nt) instead of proteins (aa)download_failed %<>% purrr::map(~gsub("AnnotatedProteins", "Genome", .))for (f in download_failed) {if (!file.exists(glue("{protein_catalogs}/VEuPathDB_v61/{basename(f)}"))){ jname <- fs::path_ext_remove(basename(f))wget_download_slurm(jobname = jname,slurm_out ="/central/scratch/jbok/slurmdump/",download_link = f,output_dir =glue("{protein_catalogs}/VEuPathDB_v61/Genomes/") )Sys.sleep(1) }}# Initiate future.batchtools backend for parallel processingfuture::plan( future.batchtools::batchtools_slurm,template =glue("{wkdir}/batchtools_templates/batchtools.slurm.tmpl"),resources =list(name ="prodigal-VEuPathDB-Genomes",memory ="5G",ncpus =1,walltime =36000 ))VEuPathDB_genome_paths <-list.files(glue("{homedir}/Downloads/protein_catalogs/VEuPathDB_v61/Genomes"),full.names =TRUE )for (genome_path in VEuPathDB_genome_paths) { output_file_name <-gsub("Genome", "AnnotatedProteins", basename(genome_path) ) output_file <-glue("{homedir}/Downloads/protein_catalogs/VEuPathDB_v61/{output_file_name}") fout %<-% {shell_do_prodigal(input_genome = genome_path,output_fasta = output_file ) }}``````{r, eval=FALSE}# concatenate individual fasta filesslurm_shell_do(cmd =glue("cat {protein_catalogs}/VEuPathDB_v61/*.f* >"," {protein_catalogs}/VEuPathDB_v61.fasta" ),memory ="10G")```#### WormBase```{r, eval=FALSE}wormbase_meta <-read.delim(glue("{wkdir}/data/input/protein_catalog_metadata/","2023-02-09_WormBase-genomes.txt" ),stringsAsFactors =FALSE, header =TRUE )for (f in wormbase_meta$Proteins) { jname <- fs::path_ext_remove(basename(f))if (!file.exists(glue("{protein_catalogs}/wormbase-v17.0/{basename(f)}"))){wget_download_slurm(jobname = jname,slurm_out ="/central/scratch/jbok/slurmdump/",download_link = f,output_dir =glue("{protein_catalogs}/wormbase-v17.0") )Sys.sleep(0.5) }}``````{r, eval=FALSE}slurm_shell_do(cmd =glue("gunzip {protein_catalogs}/wormbase-v17.0/*.gz"),memory ="5G",walltime =360000)``````{r, eval=FALSE}# concatenate individual fasta filesslurm_shell_do(cmd =glue("cat {protein_catalogs}/wormbase-v17.0/* >"," {protein_catalogs}/wormbase-v17.0.fasta" ),memory ="10G")```## Catalog AssemblyMMSeq2 95% similarity clustering of each catalog independently to remove redundant sequences```{r}# Utility function for running MMSeq2 clustering on the HPCslurm_shell_do_mmseq2_clust <-function(input_fasta, output,id_threshold =0.95,jobname =glue("MMSeq2-{rand_string()}"),memory ="120G", # per cpuncpus =12,walltime =36000,tmpdir ="/central/scratch/jbok/tmp",working_dir = wkdir) {source(glue("{working_dir}/notebooks/R-scripts/helpers_general.R")) mmseq_cmd <-glue("mmseqs easy-linclust --cov-mode 1 -c 0.8"," --kmer-per-seq 80"," --min-seq-id {id_threshold}"," --threads {ncpus}"," {input_fasta}"," {output}"," {tmpdir}" )slurm_shell_do(template_path =glue("{wkdir}/batchtools_templates/batchtools.slurm.tmpl"),cmd = mmseq_cmd,jobname = jobname,memory = memory,ncpus = ncpus,walltime = walltime,working_dir = working_dir )}``````{r}catalog_paths <-list("UHGP"=glue("{protein_catalogs}/UHGP_v2.0.1/uhgp-95/uhgp-95.faa"),"ELGG"=glue("{protein_catalogs}/ELGP_95.faa"),"KIJ"=glue("{protein_catalogs}/KIJ_CD-HIT-100_Proteins.faa"),"Hadza"=glue("{protein_catalogs}/hadza.fasta"),"RUMMETA"=glue("{protein_catalogs}/RGMGC.geneSet.faa"),"RUGS"=glue("{protein_catalogs}/cow-rumen-v1.0/protein_catalogue-95/protein_catalogue-95.faa"),"MGV"=glue("{protein_catalogs}/mgv_proteins.faa"),"VEuPathDB"=glue("{protein_catalogs}/VEuPathDB_v61.fasta"),"Wormbase"=glue("{protein_catalogs}/wormbase-v17.0.fasta"))``````{r, eval=FALSE}for (catalog innames(catalog_paths)) {print(catalog) output_dir <-glue("{protein_catalogs}/clustered_catalogs/{catalog}")shell_do(glue("mkdir -p {output_dir}"))if (file.exists(glue("{output_dir}/{catalog}-MMSeq2-95_rep_seq.fasta") )) {message(catalog, " file already exists ...")next }slurm_shell_do_mmseq2_clust(input_fasta = catalog_paths[[catalog]],output =glue("{output_dir}/{catalog}-MMSeq2-95") )}```Adding a unique catalog ID to the headers of all FASTAs```{r, eval=FALSE}clustered_rep_paths <-names(catalog_paths) %>% purrr::set_names() %>% purrr::map(~glue("{protein_catalogs}/clustered_catalogs/{.}/","{.}-MMSeq2-95_rep_seq.fasta" ))# check that files existif (!all(purrr::map(clustered_rep_paths, ~file.exists(.)) %>% unlist )) stop()clustered_named_rep_paths <-list()for (catalog innames(clustered_rep_paths)) { clustered_named_rep_paths[[catalog]] <-glue("{protein_catalogs}/clustered_catalogs/{catalog}/","{catalog}-MMSeq2-95_rep_seq_renamed.fasta" )if (file.exists(clustered_named_rep_paths[[catalog]])) {message(catalog, " file already exists ...")next }message("Annotating: ", catalog)slurm_shell_do(glue("/home/jboktor/bbmap/bbrename.sh"," in={clustered_rep_paths[[catalog]]}"," out={clustered_named_rep_paths[[catalog]]}"," prefix=CATID_{catalog}"," addprefix=t"," fixjunk=t"," -Xmx100g" ),memory ="100G", walltime =36000 )Sys.sleep(2)}```Collecting summary metrics for each clustered catalog```{r, eval=FALSE}clust_catalog_protein_n <-list()for (catalog innames(clustered_rep_paths)) {message("Counting proteins in: ", catalog) clust_catalog_protein_n[[catalog]] <-shell_do(glue("grep -c '^>' {clustered_rep_paths[[catalog]]}"),stdout_path =TRUE ) %>% as.numeric}```Concatenating clustered FASTAs```{r, eval=FALSE}shell_do(glue("mkdir -p {protein_catalogs}/clustered_catalogs/merged"))# check that files existif (!all(purrr::map( clustered_named_rep_paths, ~file.exists(.) ) %>% unlist )) stop()for (catalog_path in clustered_named_rep_paths) {message("Merging", catalog_path)# pause execution until previous job is completecheck_slurm_overload(njobs =1, interval =5)slurm_shell_do(glue("cat {catalog_path} >> ","{protein_catalogs}/clustered_catalogs/merged/","2023-02-13_concatenated-catalog.fasta" ), memory ="50G")}```MMSeq2 similarity clustering of concatenated catalogs```{r, eval=FALSE}# Clustering at 95% (kmer for seq 200)slurm_shell_do_mmseq2_clust(input_fasta =glue("{protein_catalogs}/clustered_catalogs/","merged/2023-02-13_concatenated-catalog.fasta" ),output =glue("{protein_catalogs}/clustered_catalogs/","merged/2023-02-13_catalog_MMSeq2-95" ),memory ="128G",ncpus =8)# 90% Identity (kmer for seq 200)slurm_shell_do_mmseq2_clust(input_fasta =glue("{protein_catalogs}/clustered_catalogs/","merged/2023-02-13_concatenated-catalog.fasta" ),output =glue("{protein_catalogs}/clustered_catalogs/","merged/2023-02-13_catalog_MMSeq2-90" ),id_threshold =0.9,memory ="10G",ncpus =8)# 50% Identity (kmer for seq 80)slurm_shell_do_mmseq2_clust(input_fasta =glue("{protein_catalogs}/clustered_catalogs/","merged/2023-02-13_concatenated-catalog.fasta" ),output =glue("{protein_catalogs}/clustered_catalogs/","merged/2023-02-13_catalog_MMSeq2-50" ),walltime =270000,id_threshold =0.5,memory ="250G",ncpus =4)```Chunking protein catalog into 50 bins```{r, eval=FALSE}#' function to divide original fasta into N smaller fasta files, roughly even in size#' using bash awkget_awk_chunk_cmd <-function(fasta_path, nbins, output_dir){ filename <- fs::path_ext_remove(basename(fasta_path)) protein_n <-shell_do(glue("grep -c '^>' {fasta_path}"), stdout_path =TRUE ) %>%as.numeric() proteins_per_file <-ceiling(protein_n/nbins) chunk_cmd <-glue("awk 'BEGIN {n=0;} /^>/ ","{if(n%{{proteins_per_file}==0)","{file=sprintf(\"{{output_dir}/{{filename}_chunk_%d.fasta\",n);} ","print >> file; n++; next;} { print >> file; }' ","< {{fasta_path}", .open ="{{")return(chunk_cmd)}``````{r, eval=FALSE}slurm_shell_do(cmd =get_awk_chunk_cmd(fasta_path =glue("{protein_catalogs}/clustered_catalogs/merged/","2023-02-13_catalog_MMSeq2-95_rep_seq.fasta"),output_dir =glue("{protein_catalogs}/clustered_catalogs/merged/chunked_fasta"),nbins =50 ),jobname =glue("chunk-catalog-{rand_string()}"), working_dir = wkdir, memory ="100G", ncpus =1, walltime =360000)```Calculating the number of unique proteins in the curated catalog at each clustering threshold```{r, eval=FALSE}catalog_clustered_rep_paths <-list.files(glue("{protein_catalogs}/clustered_catalogs/merged"), pattern ="rep_seq.fasta",full.names =TRUE)repseq_protein_n <-list()for (catalog in catalog_clustered_rep_paths) {message("Counting proteins in: \n", catalog) repseq_protein_n[[catalog]] <-shell_do(glue("grep -c '^>' {catalog}"),stdout_path =TRUE ) %>% as.numeric repseq_protein_n[[catalog]] %>%print()}``````{r}catalog_n_data <-tribble(~`Clustering Threshold`, ~`# of Proteins`,"50%", 58025475,"90%", 133902480,"95%", 164949326,)catalog_n_data %>%kable(format.args =list(big.mark =",")) %>%kable_styling(font_size =15,bootstrap_options =c("striped", "hover", "condensed"),full_width = F )```Gathering representative sequences from 50% clustered catalog```{r, eval=FALSE}shell_do(glue("cd {protein_catalogs}/clustered_catalogs/merged &&"" grep -o '^>.*' 2023-02-13_catalog_MMSeq2-50_rep_seq.fasta >"," 2023-02-13_catalog_MMSeq2-50_rep_seq_headers.txt" ))``````{r, eval=FALSE}# Read in txt file with 50% clustered catalog headerscluster_50_headers <-read.delim(glue("{protein_catalogs}/clustered_catalogs/merged/","2023-02-13_catalog_MMSeq2-50_rep_seq_headers.txt" ),stringsAsFactors = F, header = F)set.seed(42)cluster_50_headers_downsample <- cluster_50_headers %>%as.data.table() %>%slice_sample(n =500000) %>%mutate(V1 =gsub(">", "", V1) %>%str_trim(side ="right")) %>%pull(V1)# save downsampled fastaextract_seq_from_catalog(fasta_header_list = cluster_50_headers_downsample,output_fasta_path =glue("{protein_catalogs}/clustered_catalogs/merged/","2023-02-13_catalog_MMSeq2-50_500k-subsample.fasta" ),catalog_path =glue("{protein_catalogs}/clustered_catalogs/merged/","2023-02-13_catalog_MMSeq2-50_rep_seq.fasta" ))subsamp550k <- protr::readFASTA(glue("{protein_catalogs}/clustered_catalogs/merged/","2023-02-13_catalog_MMSeq2-50_500k-subsample.fasta" ))saveRDS( subsamp550k,glue("{wkdir}/data/interim/protein_catalog/{Sys.Date()}_custom_catalog_500k.rds"))```## Calculating Molecular Descriptors for Protein CatalogCalculating peptide lengths and filtering peptides shorter than 7 amino acids```{r, eval=FALSE}subsamp500k <-readRDS(glue("{wkdir}/data/interim/protein_catalog/","2023-04-24_custom_catalog_500k.rds" ))seq_lengths <- subsamp500k %>%nchar() %>%as.list()# select proteins that are at least 7 amino acids longfiltered_seq_list <- seq_lengths %>%keep(. >6)filtered_subsample_seqs <- subsamp500k %>%keep(names(.) %in%names(filtered_seq_list))set.seed(42)random_catalog_balanced_headers_n100 <-data.frame("fasta_headers"=names(filtered_subsample_seqs)) %>%mutate(catalog = strex::str_before_nth(fasta_headers, "_", 2) %>%gsub("CATID_", "", .)) %>%group_by(catalog) %>%slice_sample(n =100) %>%pull(fasta_headers)saveRDS( random_catalog_balanced_headers_n100,glue("{wkdir}/data/interim/protein_catalog/random_catalog_balanced_headers_n100.rds"))seq_lengths_df <-as.data.frame(seq_lengths) %>%pivot_longer(everything(), names_to ="id", values_to ="protein_length") %>%distinct()saveRDS( seq_lengths_df,glue("{wkdir}/data/interim/protein_catalog/seq_lengths_df.rds"))```Visualizing histogram of sequence lengths```{r, eval=FALSE}seq_lengths_df <-readRDS(glue("{wkdir}/data/interim/protein_catalog/seq_lengths_df.rds"))p_cat50_seqhist <- seq_lengths_df %>%ggplot(aes(x = protein_length)) +scale_x_log10() +geom_histogram(bins =250, fill ="lightblue") +labs(x ="Protein Length", y ="Count") +theme_bw()ggsave(glue("{wkdir}/figures/protein_catalog/sequence-length-hist.png"), p_cat50_seqhist,width =6, height =5, dpi =300)```Calculating Normalized Moreau-Broto autocorrelation descriptors ```{r}protr_descriptors <-tribble(~Accession, ~Molecular_Descriptor,"CIDH920105", "Normalized Average Hydrophobicity Scales","BHAR880101", "Average Flexibility Indices","CHAM820101", "Polarizability Parameter","CHAM820102", "Free Energy of Solution in Water, kcal/mole","CHOC760101", "Residue Accessible Surface Area in Tripeptide","BIGC670101", "Residue Volume","CHAM810101", "Steric Parameter","DAYM780201", "Relative Mutability")protr_descriptors %>%kable() %>%kable_styling(font_size =15,bootstrap_options =c("striped", "hover", "condensed"),full_width = F )``````{r, eval=FALSE}library(protr)possiblyextractMoreauBroto <-possibly(.f = extractMoreauBroto, otherwise =NULL)possiblyextractDescScales <-possibly(.f = extractDescScales, otherwise =NULL)random_catalog_balanced_headers_n100 <-readRDS(glue("{wkdir}/data/interim/protein_catalog/random_catalog_balanced_headers_n100.rds"))future::plan(multisession)tic()proteochemical_descriptors <- filtered_subsample_seqs[random_catalog_balanced_headers_n100] %>% furrr::future_map(~possiblyextractMoreauBroto(., nlag =7)) %>%bind_rows(.id ="id")toc()saveRDS( proteochemical_descriptors,glue("{wkdir}/data/interim/protein_catalog/extractMoreauBroto.rds"))future::plan(multisession)tic()molPropDescriptors <- filtered_subsample_seqs[random_catalog_balanced_headers_n100] %>% furrr::future_map(~possiblyextractDescScales( ., propmat ="AAMolProp", pc =3, lag =7, silent =TRUE) ) %>%bind_rows(.id ="id")toc()saveRDS( molPropDescriptors,glue("{wkdir}/data/interim/protein_catalog/molPropDescriptors.rds"))```Visualizing autocorrelation descriptors ```{r, eval=FALSE}MoreauBrotoDescriptors <-readRDS(glue("{wkdir}/data/interim/protein_catalog/extractMoreauBroto.rds"))pca_df <- MoreauBrotoDescriptors %>%distinct() %>%mutate(catalog = strex::str_before_nth(id, "_", 2) %>%gsub("CATID_", "", .))pca_df %>%plot_ly(x =~CHOC760101.lag7,y =~CHAM820101.lag7,z =~DAYM780201.lag7,text =~paste("ID:", id),mode ="markers", marker =list(size =6) ) %>%add_markers(color =~ catalog,colors =pal_npg(palette =c("nrc"), alpha =1)(9) ) %>%layout(scene =list(xaxis =list(title ="Residue Accessible Surface Area in Tripeptide"),yaxis =list(title ="Polarizability Parameter"),zaxis =list(title ="Relative Mutability") ))```Visualizing Molecular Property descriptors ```{r}molPropDescriptors <-readRDS(glue("{wkdir}/data/interim/protein_catalog/molPropDescriptors.rds"))pca_df <- molPropDescriptors %>%distinct() %>%mutate(catalog = strex::str_before_nth(id, "_", 2) %>%gsub("CATID_", "", .))pca_df %>%plot_ly(x =~scl1.lag7,y =~scl2.lag7,z =~scl3.lag7,text =~paste("ID:", id),mode ="markers", marker =list(size =6) ) %>%add_markers(color =~ catalog,colors =pal_npg(palette =c("nrc"), alpha =1)(9) ) %>%layout(scene =list(xaxis =list(title ="Scale 1 - Lag 7"),yaxis =list(title ="Scale 2 - Lag 7"),zaxis =list(title ="Scale 3 - Lag 7") ))```## Catalog Comparisons to UniRef90 and MGnify90Downloading MGnify90 and UniRef90 for comparison```{r, eval=FALSE}wget_download_slurm(jobname ="mgy_clusters_fasta",slurm_out ="/central/scratch/jbok/slurmdump/",walltime ="4-00:00:00",download_link ="http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2023_02/mgy_clusters.fa.gz",output_dir ="/central/groups/MazmanianLab/joeB/Downloads/RefDBs/MGnify_2023_02/")wget_download_slurm(jobname ="uniref90_fasta",slurm_out ="/central/scratch/jbok/slurmdump/",walltime ="4-00:00:00",download_link ="https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz",output_dir ="/central/groups/MazmanianLab/joeB/Downloads/RefDBs/UniRef/")```Wrangling UniRef90 and MGnify90```{r, eval=FALSE}refdb_dir <-"/central/groups/MazmanianLab/joeB/Downloads/RefDBs"scratch_dir <-"/central/scratch/jbok/mim_temp/clustering"shell_do(glue("mkdir -p {scratch_dir}"))custom_90p_path <-glue("{protein_catalogs}/clustered_catalogs/merged/","2023-02-13_catalog_MMSeq2-90_rep_seq.fasta" )# copy gz fasta files into scratch dirslurm_shell_do(glue("cp {refdb_dir}/MGnify_2023_02/mgy_clusters.fa.gz {scratch_dir}/"), memory ="10G" )slurm_shell_do(glue("cp {refdb_dir}/UniRef/uniref90.fasta.gz {scratch_dir}/"), memory ="10G" )# unzip fasta files && renameslurm_shell_do(glue("cd {scratch_dir} &&"," gunzip mgy_clusters.fa.gz &&"," mv mgy_clusters.fa MGnify90_custom90_merged.fasta"), memory ="10G" )slurm_shell_do(glue("cd {scratch_dir} &&"," gunzip uniref90.fasta.gz &&"," mv uniref90.fasta UniRef90_custom90_merged.fasta"), memory ="10G" )# Concatenate CUSTOM 90% Catalog with UniRef90 and MGnify90slurm_shell_do(glue("cat {custom_90p_path} >> {scratch_dir}/UniRef90_custom90_merged.fasta"), memory ="10G" )slurm_shell_do(glue("cat {custom_90p_path} >> {scratch_dir}/MGnify90_custom90_merged.fasta"), memory ="10G" )```Clustering UniRef90 and MGnify90 + Custom catalog with MMSeq2```{r, eval=FALSE}# 90% Identity (kmer for seq 80) UniRef90slurm_shell_do_mmseq2_clust(input_fasta =glue("{scratch_dir}/UniRef90_custom90_merged.fasta"),output =glue("{scratch_dir}/2023-04-17_MMSeq2-90_UniRef90_custom90"),walltime =270000,id_threshold =0.9,memory ="250G",ncpus =4)# 90% Identity (kmer for seq 80) MGnify90slurm_shell_do_mmseq2_clust(input_fasta =glue("{scratch_dir}/MGnify90_custom90_merged.fasta"),output =glue("{scratch_dir}/2023-04-17_MMSeq2-90_MGnify90_custom90"),walltime =270000,id_threshold =0.9,memory ="300G",ncpus =4)```## ResultsAt 95% similarity, our catalog contains 164,949,326 unique protein sequences. This compendium represents the global human and ruminant gut microbiome genomic repertoire, in addition to common eukaryotic pathogens.```{r}sessionInfo()```